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ABSTRACT 

We introduce a model of DNA sequence evolution which 
can account for biases in mutation rates that depend on the 
identity of the neighboring bases. An analytic solution for 
this class of models is developed by adopting well-known 
methods of nonlinear dynamics. Results are presented for 
the CpG-methylation-deamination process which dominates 
point substitutions in vertebrates. The dinucleotide fre- 
quencies generated by the model (using empirically obtained 
mutation rates) match the overall pattern observed in non- 
coding DNA. A web-based tool has been constructed to 
compute single- and dinucleotide frequencies for arbitrary 
neighbor-dependent mutation rates. Also provided is the 
backward procedure to infer the mutation rates using maxi- 
mum likelihood analysis given the observed single- and din- 
ucleotide frequencies. Reasonable estimates of the mutation 
rates can be obtained very efficiently, using generic non- 
coding DNA sequences as input, after masking out long 
homonucleotide subsequences. Our method is much more 
convenient and versatile to use than the traditional method 
of deducing mutation rates by counting mutation events in 
carefully chosen sequences. More generally, our approach 
provides a more realistic but still tractable description of 
non-coding genomic DNA, and may be used as a null model 
for various sequence analysis applications. 

Categories and Subject Descriptors 
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— Biology and Genetics; G.l.m [Mathematics of Com- 
puting]: Numerical Analysis — Miscellaneous; 1.6.5 [Com- 
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1. INTRODUCTION 

Models of DNA sequence evolution have generally treated 
sequences as collections of independently evolving sites 8 . 
However, there is a substantial amount of evidence from se- 
quence analysis and studies of mutation rates|5] which sug- 
gests that the identities of neighboring bases can have a 
strong influence on the types and rates of mutational events 
which occur at a given sequence position. We briefly review 
these types of evidence before describing a model which ex- 
plicitly accounts for neighbor-dependent mutations. 

Biochemical studies in the early 1970s compared the pattern 
of dinucleotide odds ratios (dinucleotide frequencies normal- 
ized for the base composition) or "general designs" of dif- 
ferent genomes and different fractions of genomic DNArT71 
I18| and concluded that this pattern is a remarkably stable 
property of a genome which is largely preserved in closely 
related genomes or in different renaturation rate fractions of 
the same genome, for example. Some two decades later, a 
significant body of sequence analysis work primarily by Kar- 
lin and coworkers! 101 1111 112| has elaborated and expanded 
on these observations, showing that the pattern of dinu- 
cleotide relative abundance values (essentially equivalent to 
the general design) constitutes a "genomic signature" in the 
sense that it is remarkably constant across different parts 
of a genome and is generally similar between related or- 
ganisms, but quite different between distantly related or- 
ganisms. This latter finding has led to application of this 
principle to phylogeny reconstruction and identification of 
laterally transferred genes in both prokaryotic and eukary- 
otic organisms. Although quite promising, this line of work 
is open to the criticism that there is no underlying theory for 
why a genome should possess a particular signature or the 
mechanism by which signatures change over long periods of 
time. The model described here may help to provide a the- 
oretical framework for interpreting these and related obser- 
vations. Our analysis may also contribute to understanding 
of the mechanisms of DNA mutation and repair and may 
help in the development of better methods for phylogenetic 
tree construction. 
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Possible factors relating to the genome signature include ef- 
fects related to DNA structure, base stacking and thermody- 
namics as well as a variety of mutational factors. Since the 
genome signature is particularly pronounced in non-coding 
sequences which are typically under much less selective pres- 
sure than coding regions and evolve much more rapidly, it 
appears likely that "nonselective" forces such as biases re- 
lated to mutation, replication and DNA repair account for 
many of the properties of the genome signature, especially 
in higher eukaryotic organisms with large genomes contain- 
ing mostly non-coding DNA and relatively small effective 
population sizes. One of the most common types of mu- 
tation in vertebrate genomes is CpG methylation followed 
by deamination and mutation of CG/CG to TG/CA |1] [Tfj] . 
For example, Wang et al.[20\ found that single-nucleotide 
polymorphisms occur approximately tenfold more often at 
CpG dinucleotides than at other dinucleotides in human ge- 
nomic DNA, suggesting that this is by far the most common 
type of single base mutation in humans. This is in agree- 
ment with the earlier finding by Batzer et. aZ.|5j and Hess 
et. al.^j\ that point substitutions occur 10 times more fre- 
quently at CpG sites compared to non-CpG sites. Increased 
mutation rate associated with CpG methylation can explain 
at least qualitatively why CpG dinucleotides have consistently 
extremely low relative abundance values in all vertebrate 
nuclear genomes [II . This effect was recently exploited by 
Fryxell and ZuckerkandhS) in their theory of the origin of 
genomic isochores in mammals. 

Here we consider a model for DNA sequence evolution which 
includes neighbor-dependent mutation effects. Efficient com- 
putational tools are constructed to solve the model for ar- 
bitrary user-specified mutation processes and rates. The 
main results are quantitative answers to (1) the long term 
effects of neighbor-dependent mutations on the base com- 
positional structure of a genome, and (2) the underlying 
mutation processes and rates given sequence data with din- 
ucleotide correlations. In principle, such a model should be 
able to account for the effects of CpG-methylation induced 
mutation and several other known contextual effects on mu- 
tation rates. However, in order to preserve the tractability 
of the model, certain other types of mutations which occur 
in nature had to be ignored. For example mutations which 
change the lengths of DNA sequences (insertions and dele- 
tions such as those caused by polymerase slippage) or more 
complex mutations such as inversions or other large-scale re- 
arrangements are not considered. Nevertheless, the current 
model is a reasonable first approximation to DNA sequence 
evolution in the absence of selection. 

This paper is organized as follows: In Sec. 2, we introduce 
our model of sequence evolution with neighbor-dependent 
mutation, and define the mutation rate matrices. In Sec. 
3, we describe the general scheme used to calculate the 
single and dinucleotide frequencies, and present the pat- 
tern of dinucleotide odds ratio obtained for a particular ex- 
ample involving the CpG-methylation-deamination process 
described above. We also present the backward analysis, 
where a maximum likelihood approach is used to infer the 
mutation rates from the observed dinucleotide frequencies 
within the confines of our model. A web server is made 
available for the public to perform these calculations at 
http://bioinfo.ucsd.edu/dinucleotides In Sec. 4, we 



list some of a large number of studies made possible by the 
method described here. Details of the method are relegated 
to the Appendix. 

2. THE MODEL 

We consider a sequence of L nucleotides ai, a.2, ■ ■ ■ , oll with 
OLi 6 {A, C,G,T}. The configuration at time t is denoted 
by d(t) — (ai (t), a2(t), . . . , ). There are two types 

of mutation processes allowed: (1) mutations of a single nu- 
cleotide ai — > a[ which occurs with a rate Q aa i independent 
of its neighbors, and (2) mutations of a pair of neighbor- 
ing nucleotides aiOi+i — > a' i a' i+1 which occurs with a rate 
R a pa'p' ■ These rates are positive numbers and fixed in time. 

We start with an initial random sequence a(0). The dynam- 
ics of the model is Markovian. At any time t, the sequence 
a(t) is updated in discrete time steps At/L according to the 
following update rules: 

1. A position i is chosen at random between 1 and L; 

2. The nucleotide ai from the sequence d(t) is mutated 
to a'i with 

r> / /i \ f 1 + Qa, ai ■ At for ai = a ■ 
P r K|a*) = [ Q ai< .At otherwise W 

to generate an intermediary sequence a'(t); 

3. A pair of neighboring positions j and j + 1 is chosen 
at random between 1 and L; 

4. The nucleotides a'ja'j +1 from the sequence a'(t) are 
mutated to a' :J 'a'J +1 with 

T. ( H It \ I I \ 

Pr(a i a i+1 |a J -a i+1 ) = 

! 1 + ^ Q H+i a H+i'f^ 
for a'j = a" and a'j +1 = a" +1 , . 
R ////»• Ai *• ' 

3 3+1 3 J+l 
otherwise 

to generate the new sequence a(t + At/L). 

To guarantee the conservation of the transition probabilities, 
the rates must be chosen such that Q aa = — ' Qaa' an d 
Raf3ap = —^2' a >piRai3a'i3'- The time increment At should 
be chosen such that all non-diagonal transition probabilities 
in Eqs. Q and (|5J are small (<<iC 1). Then after L such 
iterations, on average every base in the sequence d(t) is mu- 
tated with probability (Q + R) ■ At, corresponding to a net 
increment of time by At. The above procedure is then re- 
peated for a long time 1 until the stationary state is reached. 
Since the model is ergodic, the stationary state is unique. 
We denote the probability to find a configuration a in the 
stationary state by P(a). 

We refer to rates Q aa i and R a p a '0' collectively as the mu- 
tation matrix Q and R, respectively. There are a total of 12 
independent rates Q aa < and 240 independent rates R a p a i ■ 
Note that in the special case of neighbor-dependent single 
nucleotide mutation, e.g., the CpG-methylation-deamination 

1 Dynamical aspects of this evolution model will be discussed 
elsewhere 0. 
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process described in Sec. 1, R is given by a restricted form 
with R a g a ipi non-zero only if a = a' or j3 = ft'. There 
are still 96 independent rates under this restriction. For 
non-coding DNA sequences, mutations on the complemen- 
tary DNA strand should generally occur at the same rate 
as on the forward strand. We therefore expect a reverse 
complementary symmetry in the rates, i.e., Q a fs = Q^yj, 
and R a p-yS = Rp^s-' where a denotes the complement of 
a, e.g. A — T. These complementarity conditions reduce 
the number of independent parameters by another factor of 
two. Note however that for the calculation to be presented 
below, it is not necessary to impose any of these conditions 
on Q or R. 

3. METHODS AND RESULTS 

The subject of this study is the stationary probability dis- 
tribution P(a) for the mutation processes characterized by 
the rates Q and R. If each nucleotide position evolves inde- 
pendently of each other, i.e., if R = 0, then the distribution 
factorizes, with P(a) = FT. Po(ai), where Po{a) is given by 
the eigenvector corresponding to the largest eigenvalue of 
Q. However, for 0, the bases in the sequence do not 

evolve independently and the stationary distribution is diffi- 
cult to calculate. Instead, we focus on the single nucleotide 
frequencies (or the "base composition" ) and the dinucleotide 
frequencies, given by 

U = \ £ £ S lha P0) (3) 
i=i 

1 

f a P = tzt[ J2 J2 s ii«K+iP p (i)i ( 4 ) 

i — 1 y 

respectively. An important motivation for computing these 
frequencies is that they are easily measured from actual se- 
quence datajlO) and can thus be used for quantitative com- 
parison with the output of our model. 

3.1 Forward analysis: Computation of nu- 
cleotide frequencies 

Our first goal is to compute the frequencies f a and / a/ g in 
the stationary state for any set of mutation rates Q and 
R. Exact solution of these quantities is still difficult be- 
cause in order to compute the dinucleotide frequencies, one 
needs to know the trinucleotide frequencies f a p-y, etc., end- 
ing up with an infinite hierarchy of equations. This is a 
frequently encountered problem in coupled dynamical sys- 
tems. Here we introduce an approximation procedure called 
the "two-cluster approximation" which is well-known from 
nonlinear dynamics[3]. This procedure truncates the hierar- 
chy of equations, expressing the trinucleotide frequencies as 
a function of the single and dinucleotide frequencies, i.e., 

fap-y = faff f 0y / f/3 , (5) 

and then solves for the dinucleotide frequencies simultane- 
ously. The procedure gives the exact solution if the station- 
ary state of the mutation process R is a first-order Markov 
chain; it is generally very accurately as long as the sequence 
correlation in the stationary state is short-ranged. Compar- 
ison of solutions obtained using this method with the best 



numerical estimate from Monte-Carlo simulation yielded rel- 
ative disagreement well below the 1% level; see below. Thus, 
for practical purposes, we can regard the cluster approxima- 
tion as generating "exact" results. The advantage of using 
the cluster approximation (instead of Monte-Carlo simula- 
tion) is that the single and dinucleotide frequencies can be 
computed virtually instantaneously by numerically solving 
a set of algebraic equations for arbitrary mutation matrices 
Q and R, i.e., without any constraints on the rates. We 
have developed a web server at http : //bioinf o . ucsd . edu/ 
dinucleotides to perform this calculation. 

We will present our method here via a specific example, 
the CpG-methylation-deamination process described in the 
introduction. This process is described by a single neighbor- 
dependent mutation rate, i.e. 

-Rcgca = -Rcgtg = r (6) 

and Rc/3-yS ~ for all other nondiagonal entries. To make 
this example transparent, we also consider a simplified ver- 
sion of the neighbor-independent mutation rate Q, adopting 
a single rate q for all transitions and another rate p for all 
transversions, i.e., 

Qag = Qga = Qct = Qjc=q , (7) 

QaC = QcA = QAT = QtA = QcG = QgC = QgT = QlQ=P ■ (8) 

For this simple case, the single and dinucleotide frequen- 
cies can be solved analytically in closed form under the 
two-cluster approximation as described in the Appendix. 
Note that since the stationary state does not involve any 
time scale in itself, the results only depend on two effec- 
tive parameters, q/p and r/p. The transition/transversion 
ratio has been estimated to be q/p ~ 3 previously, based 
on mammalian pseudogene studies |7I l9l 1131 \'21] . Combined 
with the observed 10-fold difference^] |3j in point substitu- 
tion counts between CpG and non-CpG sites [which implies 
2(r + g) ~ 10 x (2p + q)], we obtain r/q » 20 for mam- 
mals. The nucleotide frequencies computed according to 
these rates are shown in Table □ They compare very well to 
the results of Monte-Carlo simulation: the latter performed 
for an L = 10 8 system (and averaged over 100 simulations) 
yielded results that were within an rms deviation of 10~ 5 
in the single-nucleotide frequencies and 4 ■ 10~ 5 in the din- 
ucleotide frequencies. 
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u 


fak 


faC 


/ctG 


fail 


A 


0.28606 


0.07475 


0.06119 


0.06827 


0.08182 


C 


0.21394 


0.08052 


0.05071 


0.01442 


0.06827 


G 


0.21394 


0.05625 


0.04577 


0.05071 


0.06119 


T 


0.28606 


0.07451 


0.05625 


0.08052 


0.07475 



Table 1: Single and dinucleotide frequencies from 
the two- cluster approximation for q/p — 3 and r/p = 
20. 

Let us examine the results presented in Table Q First we 
note that the nucleotide composition f a is skewed towards 
A and T, with a C+G content of 42%, which is in general 
agreement with the typical average C+G content in mam- 
mals. Note that unlike most existing phylogenetic studies 8 , 
where the mutation rates Q are tuned to reproduce the ob- 
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served frequencies. In our study, the mutation matrix Q 
by itself would have generated equal C+G and A+T content, 
with the anomaly coming solely from the CpG-methylation- 
deamination process. Similar observation was made in 
based on Monte Carlo simulations. 

Next, we analyze the dinucleotide frequencies. It is useful 
to focus on the dinucleotide odds ratio, p a/ s = f a g/(faf/3), 
introduced to indicate whether a specific dinucleotide pair 
a/3 is over- (p a a > 1) or under-represented (p a p < 1) with 
respect to neighbor-independent mutation. The results de- 
rived from the frequencies in Tables are shown in Table 
H(top). For comparison, the odds ratio p a g obtained from 
the observed single and dinucleotide frequencies f a and j a g 
for a region of 4Mbp integenic DNA taken from the Human 
genome Chromosome 21 (Accession # NT 011512) is given 
in Table |2| (bottom). [We will denote each observed quan- 
tity by a "hat" throughout the text.] The rms deviation 
between the observed ratio p a/ s and the computed peg's is 
0.1. We note that the computed ratios capture correctly 
the key feature of the observed data, i.e., a strong under- 
representation of the CG dinucleotides, compensated by the 
over-representation of CA and TG. 



Pa/3 
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1.00 


1.12 


1.00 


c 


1.32 


1.11 


0.32 


1.12 
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0.92 


1.00 


1.11 


1.00 
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0.91 


0.92 


1.32 


0.91 


Pap 


/3 = A 
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T 


a = A 


1.10 


0.87 


1.11 


0.91 


C 


1.20 


1.21 


0.20 


1.11 


G 


0.99 


1.05 


1.22 


0.87 


T 


0.80 


0.99 


1.21 


1.10 



Table 2: Top: Dinucleotide odds ratios of the model 
based on the single and dinucleotide frequencies in 
Table [2 Bottom: odds ratios obtained for a re- 
gion of 4Mbp integenic DNA taken from the Human 
genome Chromosome 21 (Accession # NT 011512). 

At this stage of the analysis, we cannot expect a perfect 
match between the observed and the computed frequencies 
in Table |21 since we used a very rough estimate on the rates 
q and r in the computation. One can of course always tune 
q and r to match the computed odds ratios in CG and CA/TG. 
But this is not sufficient for deducing the underlying muta- 
tion rates, since the CpG mutation process introduced also 
affects the other dinucleotide counts: As seen in Table |5] 
(top), our model produces 9 other dinucleotides that are 
over- or under- represented by ~ 10% each. They result 
from secondary mutation since all these 9 pairs are only one 
point mutation away from CG, CA, or TG. Only the four dinu- 
cleotides (AC, AT, GC, GT) that are two point mutations away 
are not affected by the CpG process. Note that the magnitude 
of the change in p a p due to the primary mutation, as well 
as the magnitude and sign of change in p a g due to the sec- 
ondary mutations could not have been anticipated directly 
from the model defined by Eqs. Q-(|HJ. Thus, the "back- 
ward analysis" of matching all of the dinucleotide ratios (as 
well as the single nucleotide frequencies) by varying the rates 
q and r is a highly nontrivial numerical task. Fryxell and 



Zuckerkandl0 attempted to do this by focusing on a few 
selected nucleotide frequencies (namely, CG, TG, TA and AT) 
which they obtained for different mutation rates using ex- 
tensive Monte Carlo simulation. Their method is arbitrary, 
inaccurate and very time consuming. With our approach, 
we can provide a systematic, accurate, and efficient method 
to perform the backward analysis as described below. Our 
method is implemented in a server at the same url as pro- 
vided above, http://bioinfo.ucsd.edu/dinucleotides . 

3.2 Backward analysis: Estimation of muta- 
tion rates 

Our next task is to estimate the mutation rates Q and R 
which best "explain" the observed data as quantified by 
single and dinucleotide frequencies f a and f a p. We will 
adopt the strategy of maximum likelihood|14). Specifically, 
we compute the likelihood £(Q,R) of observing the data 
given our mutation model with the rates Q and R: First we 
note that the probability P(a) of observing a sequence a of 
length L is 

L 

P(&) = f ai ...a L ~ JJ(/<»i_iOi//ai) 
i=2 

according to the two-cluster approximation (see Eq. and 
the Appendix). The likelihood of the occurrence of a par- 
ticular sequence with frequencies f a and f a g is then given 

by (lL/3 fap" P )/(H a ft la ) yielding the following expres- 
sion for the log-likelihood, 

log C = L Lb log(/ a/3 ) -L^/ a log(/a), (9) 

a/3 a 

where f a and f a g denote the single- and dinucleotide fre- 
quencies according to our model with rates Q and R. 

In principle, one can now search through all of the single- 
point mutation rates as embodied in Q and all possible com- 
binations of the neighbor-dependent mutation processes and 
their rates as embodied in R to maximize L. However, this 
search space is much too large even given our solution. More 
importantly, we do not want to have too many parameters to 
over-fit the data. Thus, we allow only a single transversion 
rate (which is set to 1 without loss of generality), two tran- 
sition rates, Qag = Qtc and Q C t = Qga (since a difference in 
those rates has already been reported in the literature|15|'). 
and one single neighbor-dependent mutation process. How- 
ever, we do not limit the latter to CpG and search through 
all of the 48 possibilities and their accompanying rates. If 
the maximum-likelihood solution found is still not satisfac- 
tory, then an additional neighbor-dependent process may be 
included to further improve the result. 

To illustrate the backward analysis, we feed in the single and 
dinucleotide counts of the above mentioned intergenic region 
of Human Chromosome 21. The best neighbor-dependent 
process found by our program is the known process CpG — ► 
CpA or CpG — > TpG. However, the transition rates correspond- 
ing to the maximum likelihood solution are at the bound- 
ary of the region searched. They are even smaller than the 
transversion rate p = 1, indicating that there is something 
wrong, most likely a sign that the data contained some fea- 
ture^) not anticipated by our mutation model. In order to 
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length 

Figure 1: The number of homonucleotide subse- 
quences of a given length in the above mentioned 
sample of 4Mbp intergenic DNA taken from the Hu- 
man Chromosome 21. The straight lines represent 
the expected distributions of such subsequences ac- 
cording to their observed single-nucleotide frequen- 
cies of bases and assuming no correlations between 
neighboring bases. 



get a hint at the source of the problem, we repeated the 
search allowing for an additional neighbor- dependent muta- 
tion process which would bring the transition rates to the 
expected regime. The program found the additional pro- 
cesses ApC — * ApA or GpT — > TpT; with these processes, the 
maximum-likelihood solution for the mutation rates became 
more reasonable, with Qag = Qtc = 3.10p, Qct = Qga = 
3.78 p, TicGCA = -Rcgtg = 43.02 p, and J?acaa = -Rgttt = 4.35 p. 
Since we are not aware of any biological or biochemical ev- 
idence for the additional mutation processes ApC — » ApA or 
GpT — > TpT, we interpret this result merely as an indication 
that the over-abundance of AA and TT in the observed data 
(see Table |5J makes it difficult to "explain" by the model 
with only the CpG mutation process. (From the top panel of 
Tabled it is clear that the CpG process only reduces the odds 
ratio of AA and TT, while the observed trend is the opposite.) 

What might be the source of this over-abundance? From an 
inspection of the length distribution of the homonucleotide 
subsequences (Fig. 0, one sees an conspicuous over-abun- 
dance of long runs of A's or T's. These long runs, while low 
in numbers, will clearly bias the dinucleotide counts. How- 
ever, their occurrences are thought to arise from sequence- 
specific insertion processes such as polymerase slippage and 
have nothing to do with neighbor-dependent mutation be- 
ing studied here. In order to perform the backward analysis 
properly, it is therefore necessary to first filter out the ef- 
fect of such processes. The odds ratio p( fllter ' obtained after 
removing all homonucleotide subsequences of length four or 
more are presented in Table [3] (top) . Comparing it to Ta- 
ble |21 (bottom), we see substantial (> 15%) decreases in the 
counts for AA, TT, CC, and GG as expected, as well as > 10% 
increases in AT and TA. (The single nucleotide frequencies 
are however not affected much by the filter.) Applying 
the backward analysis on the filtered data with a single 
neighbor-dependent process, we recover the known CpG pro- 



cess again, but this time with the rates Qag = Qtc = 1.75 p, 
Qct = Qga = 2.33p, and .Rcgca = Rcgtg = 22.17p, which 
are all within the known range given above. [Similar results 
were obtained when we filtered out subsequences of length 
5 or longer with at least 80% of the same nucleotide. The 
rates found were Qag = Qtc = 1.66 p, Qct = Qga = 2.03 p, 
and -Rcgca = ^cgtg = 17.80 p.] The predicted odds ratio p* 
corresponding to these maximum-likelihood rates are pre- 
sented in Table |3| (bottom). The rms deviation between the 
filtered data and the maximum-likelihood prediction is 0.05 . 
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1.02 
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0.90 


1.02 


1.26 
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a — A 


0.93 


1.00 


1.11 


1.00 
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1.30 


1.09 


0.24 


1.11 
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0.94 


1.00 


1.09 


1.00 
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0.91 


0.94 


1.30 


0.93 



Table 3: Top: Observed dinucleotide odds ratios 
of intergenic DNA taken from the Human genome 
Chromosome 21 (Accession # NT 011512) with all 
homonucleotide subsequences of four or more bases 
filtered out. Bottom: The odds ratio p* correspond- 
ing to the maximum-likelihood solution of the back- 
ward analysis. The rms deviation between the two 
odds ratios is 0.05 . 

4. DISCUSSION AND OUTLOOK 

We have introduced a model of neighbor-dependent muta- 
tion and developed methods to study these processes which 
are believed to play an important role in the evolution of ge- 
nomic sequences. We see that the base composition and the 
dinucleotide frequencies are strongly influenced by neighbor- 
dependent mutation. This may contribute to the (surpris- 
ing) success of phylogenetic analysis based on dinucleotide 
counts^^^!; since the pattern of dinucleotide frequencies 
reflect the mutation mechanisms and rates which should be 
highly conserved throughout evolution. A similar mutation 
model was investigated in Ref. using Monte Carlo sim- 
ulations. In that work, the backward analysis required a 
number arbitrary assumptions and large amounts of com- 
puter time. Our work represents a systematic approach to 
these studies. The analysis tools we provide enable the users 
to examine a large number of mutation processes and rates 
in detail and over a large number of genomes, to look for pre- 
viously unknown mutation processes and track them across 
the different kingdoms of life. 

Of course, the mutation model (i.e., Eqs. Q and J2J) and 
the assumption of stationarity for observed sequences need 
to be verified quantitatively. But the present method is in 
fact not limited to the study of the stationary state, and 
can be straightforwardly extended to describe the evolu- 
tion towards stationarity |T]. This can be used to extend 
the traditional phylogenetic analysis[H] to include neighbor- 
dependent effects, and should be particularly useful for the 
vertebrates which are dominated by the CpG-methylation- 
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deamination process. Finally we note that a more accurate 
description of the evolution of non-coding genomic DNA se- 
quences will be of value to various sequence analysis appli- 
cations, particularly those involving mammalian genomes. 
For instance, in the comparative genomics approach to gene 
and DNA motif finding, one needs to evaluate the proba- 
bility of spurious matches due to shared common ancestry 
in the non-coding regions 19 . Accurate evaluation of such 
probabilities depend critically on having realistic models of 
sequence evolution, which this study contributes towards. 
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APPENDIX 

A. THE TWO-CLUSTER APPROXIMATION 

To illustrate our method, we consider a specific example of 
the neighbor-dependent mutation process, the CpG-methy- 
lation-deamination process mentioned in the introduction. 
A similar model has also been studied using Monte-Carlo 
simulations in Ref. [5J. Given the mutation processes de- 
fined by Eqs. Q, Q an d O • the time evolution of the 
dinucleotide frequency f a p can be expressed in terms of the 
function f a p and the trinucleotide frequency / Q/ 3 7 only. It 
will be convenient to take the continuum time limit, and 
turn the discrete dynamics described in Sec. 2 into differen- 
tial equations^- For example, the evolution of fu is given 
by 

d 

t^/aa = p/ca + qfaA + p/ta - (2p + q)fu 

+p/ac + qfiG + p/at - (2p + q)fu + ^/cga (10) 

where all terms on the right hand side correspond to pro- 
cesses either creating (positive signs) or destroying (negative 
signs) an AA pair. In Eq. 1101 . the first four terms result from 
point mutations on the first site, the second four from point 
mutations on the second site, as indicated by the under- 
lined letters. The last term proportional to r stems from a 
G turning into an A due to a CG — > CA process on the two 
sites shifted by one to the left. There are 16 such equations 
for the 16 dinucleotide frequencies. 
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As one can see in Eq. I10L the time evolution of the two- 
point functions depends in general on the three-point func- 
tions fa/3-y which in turn will depend on the four-point func- 
tions. To truncate this hierarchy of equations, we apply a 
standard closure approximation used in non-linear dynam- 
ics 



/a/37 — 



(ii) 



i.e. we approximate the probability of finding three letters 
a/?7 by the probability of finding a pair a/3 multiplied by 
the conditional probability of finding a pair f3y, given that 
the middle base is /3; the latter is expressed according to 
Bayes's rule. The L-point function subsequently takes the 
form 



n 



foe i — i a i 

~1^T 



(12) 



The single nucleotide frequencies called for in Eqs. 1111 and 
ii 1 121 can be expressed in terms of the dinucleotide frequen- 
cies as 

fa faf3 ff3c- (13) 

P 

Thus, the right hand side of Eq. HUt can be written as a 
function of the / aj a's only. We can generate an equation 
similar to 11UI for each of the 16 f a p's. Applying the ap- 
proximation 1111 and the identity 1131 . we then obtain a 
closed system of coupled, nonlinear differential equations 



d_ 

dt 



fa0 — Gap(fkk, fkl 



i /tt) 



(14) 



with 16 Gap's. In the stationary state, the functions f a p(t) 
are independent of t and hence their derivative with re- 
spect to t vanishes. To calculate the stationary f a p's, we 
therefore have to solve the set of 16 (quadratic) equations, 
Gap(fkk, /ac, • • • , /tt) = 0, whose solution is straightforwardly 
obtained with the help of Mathematica and given below: 

It is convenient to express the results in term of the param- 
eter 



A = 



(3p + q)r 



16(p + q) (3p + q) + 4(7p + 3q)r ' 
The single nucleotide frequencies are 

4 2' 7 1 4 2 ' 



(15) 



(16) 



Since A is an increasing function of r with A(r = 0) = 0, 
we see that /a = /t > 1/4 for all positive r's. 

The dinucleotide frequencies are most succinctly expressed 
in terms of the auxiliary functions f a p = f a p — f a fp- The 
results are: 



A(l + A) , r(l-2A) 2 -16(p + g)A ^ 
/ CG = ~ I 1 'J 



4 ' JW 16r 

(2A - l)(4rA 2 + 8(2p + 2q + r)A - r 



.(18) 



32r(A - 1) 

Additionally, we have 

/ac = /at = foe = ,/gt = (19) 

since these four dinucleotides are two mutations away from 
the three primary pairs above, as motivated already in Sec.|S| 



The remaining 9 frequencies can now be obtained simply by 
exploiting the relation Yl a fa/3 = = Yl a ffia which follows 
from Eq. 1131 . and the reverse complementarity symmetry 
fa/3 = fps- The results are: 

/ct — ~{fck + fee + /cg) , /tc = — fee , /tt — — /ct, 
fkk ~ /tt , Jag = /ct , fak = /tc , faa ~ fee , /tg = fck, 
/ta = —(fkk + fck + fak)- 

As shown already in Sec. this approximation gives very 
accurate results when compared to Monte-Carlo simulations 
of the same system. This is due to the very short-ranged 
correlation induced by the mutation process and will not be 
elaborated here. 
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